We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • 7 years of experience in Data analysis and statistical modelling including Data extraction, manipulation, visualization, and validation techniques, reporting on various projects.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
  • Skilled in Advanced Regression Modeling, Time Series Analysis, Statistical Testing, Correlation, Multivariate Analysis, Forecasting, Model Building, Business Intelligence tools and application of Statistical Concepts.
  • Proficient in: Data Acquisition, Storage, Analysis, Integration, Predictive Modeling, Logistic Regression, Decision Trees, Data Mining Methods, Forecasting, Factor Analysis, Cluster Analysis, ANOVA, Neural Networks, and other advanced statistical and econometric techniques.
  • Adept in writing code in Python and R to manipulate data for data loads and extracts.
  • Proficient in data entry, data auditing, creating data reports & monitoring data for accuracy.
  • Ability to extract Web search and data collection, Web data mining, Extract database from website, Extract Data entry and Data processing.
  • Extensively worked on using major statistical analysis tools such as Python, R, SQL, and MATLAB.
  • Strong knowledge in all phases of the SDLC (Software Development Life Cycle) from analysis, design, development, testing, implementation, and maintenance with timely delivery against deadlines.
  • Good knowledge and understanding of data mining techniques like classification, clustering, regression techniques and random forests.
  • Extensive experience with creating MapReduce jobs, SQL on Hadoop using Hive and ETL using PIG scripts, and Flume for transferring unstructured data to HDFS.
  • Strong ability to work with Mahout, for applying machine learning techniques in Hadoop Ecosystem.
  • Strong Oracle/SQL Server programming skills, with experience in working with functions, packages and triggers.
  • Configured aggregation of Cloud Trail logs across AWS accounts and region into singe S3 bucket to perform security analysis, track changes to AWS resources, trouble shoot operational issues and to demonstrate compliance with internal policies and regulatory standards.
  • Great expertise with Hadoop, Spark & with Data tools such as PySpark, Pig, Hive, Kafka and flume.

TECHNICAL SKILLS

Languages: R,SQL, Python, Shell scripting, Java, Scala, C++.

IDE: R Studio, Jupyter Notebook, zeppelin, Eclipse, NetBeans, Atom.

Databases: Oracle 11g, SQL Server, MS Access, MySQL, MongoDB, Cassandra PL/SQL, T - SQL, ETL.

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Impala, Kafka, Spark MLLib. PySpark, Sqoop, AVRO.

Operating Systems: Windows XP/7/8/10, Ubuntu, Unix, Linux

Packages: ggplot2, caret, dplyr, RWeka, gmodels, RCurl, tm, C50, Wordcloud, Kernlab, Neuralnet, twitter, NLP, Reshape2, rjson, plyr, pandas, NumPy, seaborn, SciPy, matplotlib, scikit-learn, Beautiful Soup, Rpy2, Tensorflow, Pytorch, CNN, RNN, XGBoost Web

Technologies: HTML, CSS, PHP, JavaScript

Data Analytics Tools: R console, Python (NumPy, pandas, SciKit-learn, SciPy), SPSS.

BI and Visualization: Tableau, SSAS, SSRS, Informatica, QuickView, Clarabridge

Version Controls: GIT, SVN

PROFESSIONAL EXPERIENCE

Confidential, Charlotte, NC

DATA ENGINEER

Responsibilities:

  • Worked as Data Engineer to review business requirement and compose source to target data mapping documents
  • Involved in Agile development methodology active member inscrummeetings
  • Involved in Data Profilingandmerge datafrom multiple data sources
  • Involved in Big datarequirement analysis, develop and design solutions for ETLand Business Intelligence platforms
  • Data from HDFS into Spark RDDs,for running predictive analytics on data.
  • Used Hive Context which provides a superset of the functionality provided by SQLContext and Preferred to write queries using the HiveQL parser to read data from Hive tables (fact, syndicate).
  • Modeled Hivepartitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
  • Created HiveFact tables on top of raw data from different retailer’s which indeed partitioned byTime dimension key, Retailer name, Data supplier namewhich further processed pulled by analytics service engine.
  • Developed highly complex Python and Scalacode, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into AWS S3.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
  • Developed Spark RDD transformations, actions, and Data Frame's, case classes, Datasets for the required input data and performed the data transformations using Spark-Core.
  • Create Data pipelines for Kafka cluster and process the data by using spark streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time.
  • Worked with cloud-based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
  • Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS) by focusing on high-availability and auto-scaling.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark and Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 bucket
  • Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch and used AWS Glue for the data transformation, validate and data cleansing.
  • Deployed applications using Jenkins’s framework integrating Git- version control with it.
  • Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau.

Environment: Python, SQL, AWS, Hadoop, Hive, MapReduce, Scala, Spark, Kafka, AWS S3, AWS Glue, Redshift, RDS, Lambda, Athena, Pyspark, Teradata, Tableau.

Confidential, New York, NY

DATA ENGINEER

Responsibilities:

  • Worked to rapidly evaluate, create, and test new use cases for the organization in a quick-paced Agile development environment.
  • Able to parse, manipulate and transform Python data to and from a broad variety of formats (CSV, json, XML, html, etc.)
  • Maintaining existing ETL workflows, data management and data query components.
  • Worked in Azure environment for development & deployment of Custom Hadoop Applications.
  • Worked closely with development, test, documentation, and product management teams to deliver high quality products and services in a face paced environment.
  • Collecting, aggregating, and moving data from servers to HDFS using Apache Flume.
  • Creating Hive tables, loading with data, and writing hive queries that will run internally in map-reduce way.
  • Migrating ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data into HDFS.
  • Involved in installation and configuration of Cloudera Distribution Hadoop platform.
  • Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
  • Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in HiveContext, with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
  • Extended the capabilities of DataFrames using User Defined Functions in Python and Scala.
  • Interaction with Spark Shell using Python API- PySpark.
  • Developing Spark programs using Scala API's to compare the performance of Spark with Hive and SQL.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
  • Migrate data into RV Data Pipeline using DataBricks, Spark SQL and Scala.
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning, and buckets.
  • Used Python libraries like NumPy & Pandas in conjunction with Spark in dealing with DataFrames.
  • Performedreal time integrationand loading data from the Azure data box& mounting it onto fuse for bulk loads.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Extensively used Azure Data Lake and worked with different file formats csv, parquet, delta.
  • Created procedures in Azure SQL Datawarehouse and built the final aggregate tables for dashboards.
  • Worked with Flume for collecting, aggregating, and moving large amounts of log data as well as for streaming log data.
  • Using Kafka to build real-time data pipelines and streaming applications, publish and subscribe to message queue (Topic), o Store streams of records in a fault-tolerant durable way, and process streams of records as they occur.
  • Worked with Spark Streaming for streaming real time data using DStreams.
  • Good knowledge of Scala/Java development.
  • Develop custom tools/scripts/packaging solutions forAEM using Java/Unix

Environment: Hadoop 2.7.7, HDFS 2.7.7, Spark 2.1, MapReduce 2.9.1, Hive 2.3, Sqoop 1.4.7, Kafka 0.8.2.X, HBase, Oozie, Flume 1.8.0, Scala 2.12.8, Python 3.7, Java 8, JSON, SQL Scripting and Linux Shell Scripting, Avro, Parquet, Azure, Azure data bricks, Hortonworks & Cloudera.

Confidential, Kansas

BIG DATA DEVELOPER

Responsibilities:

  • Performing Batch, Real time processing of Datausing Hadoop Components like Hive, Spark
  • Monitored Hadoopcluster using Cloudera Manager, interacting with Cloudera support and log the issues in Cloudera portal and fixed them as per the recommendations.
  • Continuously monitored and managed the Hadoopcluster using Cloudera manager and Web UI
  • Used Spark streaming to Process the streaming data and to analyze the continuous datasets using PySpark
  • Resolving complex issues reported in azure Databricks and HDInsight which were reported by Azure end customers.
  • Accomplished in reproducing the issues reported by azure end customers and helped in debugging the issue.
  • Building different HDInsight clusters like Hive, Spark, HBase, Kafka and LLAP interactive query with Enterprise Security Package and Virtual network
  • Debug and troubleshoot the issues in Scaling of clusters.
  • Reduced server crashing when there is heavy traffic to the website by spinning up the EC2 instances integrated with ELB and auto-scaling.
  • Used the AWS SageMaker to quickly build, train and deploy the machine learning models.
  • Implementations of generalized solution model using AWS SageMaker.
  • Saved time by using AWS EFS file synchronizing efficiently and securely from on-premises and in-cloud file systems to Amazon EFS.
  • AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, AWS Lambda.
  • Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.
  • Creation & management of EC2, S3, Identity & Access Management (IAM) Other AWS Services.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Designed and Developed ETL jobs to extract datafrom Salesforce replica and load it in datamart in Redshift.
  • Migrated on premise database structure to Confidential Redshift data warehouse and was responsible for setting up the ETL and data validation using SQL Server Integration Services.
  • Providing technical knowledge on customer-identified regular production problems and helping to find bugs from the issues reported by end customers.
  • Work on customer reported issues and provide fix for the same.

Environment: Java, Eclipse, Hadoop, MapReduce, HDFS, Sqoop, Oozie, WinSCP, UNIX Shell Scripting, HIVE, Impala, Cloudera (Hadoop distribution), AWS, SageMaker, JIRA etc.

Confidential, Dallas

DATA ANALYST

Responsibilities:

  • Built a framework using UNIX and PYTHON to validate datamigrated from TERADATA to SNOWFLAKE.
  • Managed report to identify load delays in daily updated production tables.
  • Maintained reports to identify dataissues between SNOWFLAKE and TERADATA and automated to produce daily, weekly reports or support teams.
  • Created, deployed and gave accessibility to databases on SQL servers.
  • Wrote optimized SQL queries to pull required datafor further reporting.
  • Validated production datain Snowflake platform during activity of shifting operations from AWS East to server to AWS West in order to ensure compliance with government policies is met.
  • Involved in spinning up EC2 instances and adherence to rehydration policies/maintenance.
  • Worked extensively on AWS services like Sage Maker, Lambda, Lex, EMR, S3, Redshift, Quick Sight etc.
  • Developed an Intent based chatbot on AWS Lex and wrote a server-less lambda function to invoke the model endpoint for the deployment on Sage Maker.
  • Co-ordinated with off-shore Support teams to report datainconsistencies.
  • Performed datamanipulations and datavalidations for build reporting structure.
  • Contributed as first-hand responder to dataquality issues slack channels and platforms.
  • CI/CD automation and orchestration of deployment to various environments using Gitlab, CI/CD Pipelines.
  • Optimization and troubleshooting, test case integration into CI/CD pipeline using docker images.
  • Designed and developed Ad-hoc reports as per business analyst, operation analyst, and project manager data requests.
  • Designed and Developed various analytical reports from multiple datasources using Tableau Desktop.
  • Developed dashboards on Tableau to share insights on how delays affect performance and recommendations on customer.
  • Created Source to target Data mapping document of input /output attributes with the proper transformations which would help for the easy development of Informatica code.
  • Document all Data mapping and transformation processes in the Functional Design documents based on the business requirements.
  • Built a drill down Tableau dashboard to give a brief and detailed overview of card tables to customers.
  • Used VLOOKUP and pivot table functions in EXCEL.
  • Accomplished working in loading datainto staging tables via views.
  • Designed and developed weekly, monthly reports related to the financial partnership using Teradata SQL.

Environment: Teradata, Snowflake, Tableau, AWS, Python, Data Mapping, Microsoft Office (Word, Access, Excel, Outlook)

Confidential

SQL (MSBI) Developer

Responsibilities:

  • Developed test cases and SQL test scripts based on detail data design, detail functional design and ETL specifications.
  • Worked closely with stakeholders to understand, define, document business questions needed.
  • Review system/application requirements (functional specifications), test results and metrics for quality and completeness.
  • Analyzed the sourcedatacoming from different sources (SQL Server, Oracle and also from flat files like Access and Excel) and working with business users and developers to develop the Model.
  • Worked on Predictive Modeling using SAS/SQL.
  • Performed Statistical Analysis and Hypothesis Testing in Excel by using Data Analysis Tool.
  • Interacted with the other departments to understand and identify data needs and requirements and work with other members of the IT organization to deliver data visualization and reporting solutions to address those needs.
  • Creating customized business reports and sharing insights to the management.
  • Presented a Dashboard for better understanding of dataset to the entire stake Holders.
  • Performed module specific configuration duties for implemented applications to include establishing role-based responsibilities for user access, administration, maintenance, and support.
  • Responsible for maintaining the integrity of the SQL database and reporting any issues to the database architect.
  • Design and model the reportingdatawarehouse considering current and future reporting requirement
  • Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.
  • Involved with statistical domain experts to understand the data and worked with data management team on data quality assurance.
  • Developed scripts for comparison with target. Planned and run the SIT(System Integration Testing) for the given LOB.

Environment: Informatica Power Center, HP-ALM, SharePoint, MS-Visio, MS-Excel, QlikView, SAP-BO, Oracle 11g, Microsoft SQL Server, Tableau report builder, MS Outlook, SQL Server 2012/2014, Python (Scikit-Learn, NumPy, Pandas, Matplotlib, Dateutil, Seaborn), Tableau, Hadoop.

We'd love your feedback!