Data Engineer Resume
Columbus, OhiO
SUMMARY:
- Highly motivated and results - driven Analyst with a proven track record in data science and analytics platform
- Experienced in applied mathematics, linear algebra and probability statistics.
- Experience with Python, Pyspark for Cloudera distributions and data science projects
- Good knowledge in the software and programming with emphasis on numerical algorithms, statistical modelling, mathematical theory, numerical libraries and scientific computing.
- Excellent communication and interpersonal skills.
TECHNICAL SKILLS:
Platforms: Windows 7,8 10/Ubuntu
Products: Anaconda, R Studio, Tableau Desktop, Jupyter, MatLab, Cloudera
Databases: SQL, Hive
Programming Languages: C, Python, R, Java
Software: R Studio, Spyder, Microsoft SQL server, Tableau Desktop, Java Platform, Turbo C, R Shiny, Matlab and Simulink, MiniTab 18, XLMiner, Sqoop, HDFS,Hive,Spark,Flume
PROFESSIONAL EXPERIENCE:
Confidential, Columbus, Ohio
Data Engineer
Technologies Leveraged: Cluster Size - 8 Node Cluster with AtScale on the Edge node on cloud, Cloudera CDH5.15, AtScale 7.2, Tableau 2018, Hive, Sqoop, Impala, Spark
Responsibilities:
- Installation and setup of multi node Cloudera cluster on AWS cloud
- Installation and setup of AtScale on top of Hadoop cluster using Hive and Impala as the SQL engines
- Development of cubes involving multiple facts and dimensions
- Development of calculations, leveraging Query Data sets
- Defining & managing Aggregates
Confidential
Technologies Leveraged: Hadoop, Cloudera 5.15, Spark 2.1, HDFS, Python, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Java, Unix
Responsibilities:
- Developed a mapping of Data ingestion type to the program mapping (Sqoop, Pig or Kafka)
- Developed Ingestion capability using Sqoop, Kafka and pig. Leveraged spark for data processing and transformation
- Developed the real-time / near real-time framework using Kafka and Flume capabilities
- Developed framework to decide on data formats like Parquet, AVRO, ORC etc.
- Developed Spark code using Python and Spark-SQL for faster processing and testing.
- Worked on Spark SQL for joining multiple hive tables and write them to a final hive table and stored them on S3.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Performed querying of both managed and external tables created by Hive.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
Confidential
Technologies Leveraged: Hadoop, Cloudera, Spark, HBase, HDFS, Python, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Tableau.
Responsibilities:
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Developed Spark code using Python and Spark-SQL for faster processing and testing.
- Worked on Spark SQL for joining multi hive tables and write them to a final hive table and stored them on S3.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Created Spark jobs to do lighting speed analytics over the spark cluster.
- Extracted files from Teradata through Sqoop and placed in HDFS and processed.
- Responsible to store processed data into HBase.
- Performed querying of both managed and external tables created by Hive.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Fetch and generate monthly reports, Visualization of those reports using Tableau.
Confidential
Junior Data Scientist
Responsibilities:
- To create a CIBIL type score for each retailer under the company to decide their worthiness.
- Three years of raw unstructured data was preprocessed into a structured format by me
- Using multiple years of data and different factors I created a score for each retailer and built a naïve bayes model on the retailer’s worthiness for the upcoming years.
Environment: MS Office, R Studio, Tableau Desktop, R Shiny, Python 3.2.7
Confidential
Responsibilities:
- To create a prediction models on whether a candidate would join a company or not.
- Using multivariate analysis I saw how each factor was affecting the end result and by using multilinear regression, I found out the weightage of each factor.
- Like multilinear regression, I have created the models based on decision tree and k-nn and came to a conclusion that decision was the best fit model for this data.
Environment: R Studio, R Shiny, MS Office, Tableau Desktop, Python3.2.7
Confidential
Junior Data Scientist
Responsibilities:
- I created an algorithm using the tensor- flow module in python where the camera would identify the contours of the hand and based on that identify what kind of gesture they’re giving.
- Now to automate this process I had to train the algorithm by showing it more than 3000 images so that it can learn what kind of gesture we’re giving
- We then integrated this algorithm in raspberry pi so it can be used as per our requirement
Environment: Python 3.2.7, OpenCV, TensorFlow, Pandas, Raspberry Pi 3
Confidential
Junior Data Scientist
Responsibilities:
- I had to create an algorithm where it would extract the email, address, skills, experience, names
- This algorithm was made based on the python libraries textract and nltk
- The ranking was created based on there skills and years of experience, the ranking was a score of 0-100 giving p to skills over experien
Environment: Python 3.2.7, textract, Nltk, docx2txt, Microsoft Office