Sr. Data Engineer Resume
Lake Success, NY
PROFESSIONAL SUMMARY:
- 8 years Professional experience in Big Data Development primarily using Hadoop and Spark Ecosystems.
- Data Engineer/Analyst with strong technical expertise, business experience, and communication skills to drive high - impact business outcomes through data-driven innovations and decisions.
- Good understanding of google cloud design considerations and limitations and its impact on Pricing.
- Prior experience working with container technology such as Docker, version control systems (Github).
- Experience hosting an application on GCP using Compute Engine, App Engine, Cloud SQL
- Data Mining solutions to various business problems and generating data visualizations using Python, R and Tableau.
- Solid team player, team builder, and an effective communicator.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Advance level knowledge in designing stunning visualizations using Looker and Alteryx software and publishing and presenting dashboards, Storyline on web and desktop platforms.
- Hands on experience on BigQuery and BigTable and able to write complex SQL for Data transformation activities and SQL Optimization experience.
- Worked and extracted data from various database sources like Oracle, SQL Server, DB2, and Teradata.
- Hands on experience on machine learning models such as classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Hands on experience with Dataproc and Dataproc Hub creations through JupyterLab UI for supporting Spark jobs
- Expertise in all aspects of Software Development Lifecycle (SDLC) from requirement analysis, Design, Development Coding, Testing, Implementation, and Maintenance.
- Used Airflow for orchestration and scheduling of the ingestion scripts
- Significant experience and high proficiency with structured, semi-structured and unstructured data, using a broad range of data science programming languages and big data tools including R, Python, Apache Spark, SQL, Scikit Learn.
- Hands on experience using Python and SQL for streaming data and batch processing.
- Hands on experience with tools: Cloud Functions and Cloud SQL.
- Comprehensive knowledge in working in a Test-Driven Development and Agile-Scrum Development.
- Performed advanced statistical analysis in Python/R (Time Series model, univariate and multivariate analysis of variance, PCA, survival analysis, regression modeling), presented qualitative data summary tables and figures
- Flexible with Unix/Linux and Windows Environments, working with Operating Systems like Centos5/6, Ubuntu13/14, Cosmos.
- Strong Excel skills including pivot table, VLOOKUP, charts, conditional formatting, data validation.
- Excellent presentation skills with ability to explain data insights to non-experts, good collaborative skills to communicate with cross-functional team.
- Good exposure to Python programming.
- Strong experience in the design and development of relational database concepts with multiple RDBMS databases including Oracle10g, MySQL, MS SQL Server & PL/SQL.
- Trouble-shooting production incidents requiring detailed analysis of issues on web and desktop applications, Autosys batch jobs, and databases.
- Experience in working with various SDLC methodologies like Waterfall, Agile Scrum, and TDD for developing and delivering applications.
- Strong troubleshooting and production support skills and interaction abilities with end users.
- I have closely worked with technical teams, business teams, and product owners.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. developing SQL queries to research, analyze, and troubleshoot data and to create business reports
- Take the database reporting needs and turn them into powerful SQL queries that will extract data and compile it into meaningful reports.
TECHNICAL SKILLS:
Hadoop/Big Data Technologies: HDFS, Apache NIFI, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, Impala, Zookeeper, Ambari, Storm, Spark and Kafka
No SQL Database: HBase, Cassandra, MongoDB
Monitoring and Reporting: Tableau, Custom Shell Scripts
Hadoop Distribution: HortonWorks, Cloudera, MapR
Build and Deployment Tools: Maven, Sbt, Git, SVN, Jenkins
Programming and Scripting: Scala, Java, SQL, JavaScript, Shell Scripting, Python, Scala, Pig Latin, HiveQL
Java Technologies: J2EE, Java Mail API, JDBC
Databases: Oracle, MY SQL, MS SQL Server, Vertica, Teradata
Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS
PROFESSIONAL EXPERIENCE:
Sr. Data Engineer
Confidential, Lake Success, NY
Responsibilities:
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Experience with scripting languages like PowerShell, Perl, Shell, GitHub etc.
- Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environments.
- Design star schema in Big Query
- Design and architect various layer of Data Lake.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Develop under scrum methodology and in a CI/CD environment using Jenkins.
- Participate in architecture council for database architecture recommendation.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used Scala components to implement the credit line policy based on the conditions applied on spark data frames.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
- Used pandas UDF like building the array contains, distinct, flatten, map, sort, split and overlaps for filtering the data
- Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Building automate pipelines using Jenkins and groovy scripts
- Using shell commands to push the environment and test files AWS using Jenkins automated pipelines
- Develop database application forms using MS Access in coordination with SQL tables and stored procedures.
- Served as the initial contact person for all SQL support queries.
Environment: Spark, Scala, AWS, Python, Spark SQL, Redshift, PgSQL, Data bricks, Jupiter, Kafka
Confidential, Fort Lauderdale, FL
Sr Data Engineer
Responsibilities:
- Work on requirements gathering, analysis and designing of the systems.
- Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
- Developed spark streaming application to consume JSON messages from Kafka and perform transformations.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Implemented Spark using Scala and SparkSql for faster testing and processing of data.
- Involved in developing a MapReduce framework that filters bad and unnecessary records.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.
- Migrated the computational code in hql to PySpark.
- Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
- Worked in migrating Hive QL into Impala to minimize query response time.
- Responsible for migrating the code base to Amazon EMR and evaluated Amazon eco systems components like Redshift.
- Collected the logs data from web servers and integrated in to HDFS using Flume
- Developed Python scripts to clean the raw data.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
- Used AWS services like EC2 and S3 for small data sets processing and storage
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
- Worked on importing and exporting data into HDFS and Hive using Sqoop, built analytics on Hive tables using Hive Context in spark Jobs.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
- Worked in Agile environment using Scrum methodology.
Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, PySpark, Cassandra, Oozie, Nifi, Solr, Shell Scripting, Hbase, Scala, AWS, Maven, Java, JUnit, agile methodologies, Horton works, Soap, Python, Teradata, MySQL.
Data Engineer
Confidential
Responsibilities:
- Work on requirements gathering, analysis and designing of the systems.
- Actively involved in designing Hadoop ecosystem pipeline.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Involved in designing Kafka for multi data center cluster and monitoring it.
- Responsible for importing real time data to pull the data from sources to Kafka clusters.
- Worked with spark techniques like refreshing the table and handling parallelly and modifying the spark defaults for performance tuning.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
- Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL.
- Performed SQL Joins among Hive tables to get input for Spark batch process.
- Worked with data science team to build statistical model with Spark MLLIB and PySpark.
- Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
- Worked on creating data models for Cassandra from Existing Oracle data model.
- Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Used Sqoop to import functionality for loading Historical data present in RDBMS to HDFS
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
- Configured Hive bolts and written data to hive in Hortonworks as a part of POC.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Developed Python script for start a job and end a job smoothly for a UC4 workflow
- Developed Oozie workflow for scheduling & orchestrating the ETL process.
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
- Wrote Python scripts to parse XML documents and load the data in database.
- Worked extensively on Apache Nifi to build Nifi flows for the existing Oozie jobs to get the incremental load, full load and semi structured data and to get data from Rest API into Hadoop and automate all the Nifi flows runs incrementally.
- Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Developed shell scripts to periodically perform incremental import of data from third party API to Amazon AWS
- Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Used version control tools like GITHUB to share the code snippet among the team members.
Environment: Hadoop, HDFS, Hive, Python, Hbase, Nifi, Spark, MYSQL, Oracle 12c, Linux, Hortonworks, Oozie, MapReduce, Sqoop, Shell Scripting, Apache Kafka, Scala, AWS.
Data Engineer
Confidential
Responsibilities:
- Analyzing Functional Specifications Based on Project Requirement.
- Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.
- Extended Hive core functionality by writing custom UDFs using Java.
- Developing Hive Queries for the user requirement.
- Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from TeamCenter, SAP, Workday, Machine logs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Worked on MS Sql Server PDW migration for MSBI warehouse.
- Planning, scheduling and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
- Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool.
- Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
- Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Developed work flows in Live Compare to Analyze SAP Data and Reporting.
- Worked on Java development and support and tools support for in house applications.
- Participated in daily scrum meetings and iterative development.
- Search functionality for searching through millions of files of logistics groups.
Hadoop Engineer
Confidential
Responsibilities:
- Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities
- Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
- Created Spark jobs and Hive Jobs to summarize and transform data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Converted Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
- Used different tools for data integration with different databases and Hadoop.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
- Ingested syslog messages parse them and streams the data to Kafka.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
- Helped Devops Engineers for deploying code and debug issues.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
- Implemented business logic in Hive and written UDF’s to process the data for analysis.
- Addressing the issues occurring due to the huge volume of data and transitions.
- Designed, documented operational problems by following standards and procedures using JIRA.
- Environment: Spark, Scala, Hive, Apache NiFi, Kafka, HDFS, Oracle, HBase, MapReduce, Oozie, Sqoop
Environment: Hadoop, Hive, Sqoop, Spark, Kafka, Scala, MS SQL Server PDW, Java.